# Python Practice 8

In [16]:
import numpy as np
import matplotlib.pyplot as plt

Open a `heights.csv` file with `np.loadtxt()` function. Be sure to set a correct 
delimiter in the `np.loadtxt()` function

In [None]:
heights = # Put your code here

Let's plot the data distribution with `plt.hist()` function. Since we have a 2D array,
we might need to convert it to th 1D array first by calling the `flatten()` **method** of an array.

Generally speaking, method is usually a certain function that belongs to a class.
This function acts directly on the object it was called by (in our case - numpy array). 
You can think of the following analogy:

```python
def mean(array):
    return np.mean(array)
```
and
```python
array.mean()
```
are equivalent. The first one is a function that takes an array as an argument, while the second one is a method that belongs to the array class.

In [1]:
# Put your code here

We clearly see that something is wrong with the data. Let's try to find out
the reason for that.

First we might need to take a look, what are the exact values of those weird
heights. To do that, we need to select all the heights that are less than
10 (for instance) by using the `heights[heights < 10]` syntax.

In [3]:
# Put your code here

It turns out that the values are just listed in meters, not in centimeters. 
To fix this issue, let's create the function that checks if the height is less than 3 meters
and if it is, it multiplies the height by 100 to convert it to centimeters.

In [4]:
def fix_height_units(original_heights, threshold = 3.0):
    heights = original_heights.copy() # We don't want to modify the original array, so we make a copy
    # Put your code here
    return heights

We can also use the advanced numpy indexing to select all the heights that are less than 3 meters
and multiply them by 100. In this example we can use the `heights[heights < 3] = heights[heights < 3] * 100` syntax.
What happens is that `heights < 3` returns a boolean array of the same shape as the original array.
Each element of this array is eigher `True` or `False` depending on whether the condition is met.
Then we use this boolean array to select only the elements that are less than 3 meters and multiply them by 100.

In [None]:
def fix_height_units_vectorized(original_heights, threshold = 3.0):
    heights = original_heights.copy() # We don't want to modify the original array, so we make a copy
    heights[heights < threshold] = heights[heights < threshold] * 100
    return heights

In [None]:
fixed_heights = fix_height_units(heights, threshold = 3.0) # This is our original function
fixed_heights_vectorized = fix_height_units_vectorized(heights, threshold=3.0) # This is our vectorized function

In [55]:
np.allclose(fixed_heights, fixed_heights_vectorized) # This will fail if the two arrays are not equal

True

Now let's test the following hypothesis: 

"Is the true mean that represents the sample a given sample different from the population mean? Assume that the standard deviation of the data $\sigma = 10$. Use a 95% confidence interval."

To do that, we need to calculate first the population mean by using the `np.mean()` function over all the heights.
Then we need to calculate the samples means by using the `np.mean()` function over each sample, or by providing
the `axis=1` argument to the `np.mean()` function. 

Given that we know the standard deviation, the number of samples and the confidence level, we can use the formula
for the z-score:

$$z_{\bar{x}} = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$

where $\bar{x}$ is the sample mean, $\mu$ is the population mean, $\sigma$ is the standard deviation and $n$ is the number of samples.

In [7]:
# Put your code here

In [8]:
# Put your code here